Support TsfileDataFrame by ycycse · Pull Request #765 · apache/tsfile

ycycse · 2026-04-02T12:28:48Z

This PR introduce TsFileDataFrame， which can read multi tsfile for model training usage. Just like use dataframe.

https://apache-iotdb-project.feishu.cn/docx/SenJdxlbuoUS5Uxmq7jcOUzdnob?from=from_copylink

Build Wheel

Prerequisite: build the C++ library first, since the Python package depends on the shared library from cpp/target/build.

From the repo root:

mvn -P with-cpp clean install -DskipTests

cd python
python setup.py build_ext --inplace
python setup.py bdist_wheel

pip install dist/*.whl

Validation

  Basic dataset regression tests:

  pytest -q python/tests/test_tsfile_dataset.py

Basic Usage

Open one TsFile

  from tsfile import TsFileDataFrame

  tsdf = TsFileDataFrame("/path/to/file.tsfile")
  print(tsdf)

Open a shard directory

TsFileDataFrame will recursively collect all .tsfile files under the directory.

  from tsfile import TsFileDataFrame

  tsdf = TsFileDataFrame("/path/to/dataset_dir")
  print(len(tsdf))
  print(tsdf.list_timeseries()[:5])

Filter by metadata

  filter_df = tsdf[tsdf['filed'] == 'weather']
  print(filter_df)

Access one logical series

By index:

  series = tsdf[0]
  print(series)
  print(series.name)
  print(series.stats)
  print(series[0])
  print(series[:10])

By logical series path:

  series = tsdf["weather.device_a.temperature"]
  print(series.timestamps)

Build subset views

  subset = tsdf[:10]
  print(subset)

  subset2 = tsdf[[0, 3, 5]]
  print(subset2.list_timeseries())

Time-aligned multi-series query

  aligned = tsdf.loc[0:1000, [0, 1]]
  print(aligned)
  print(aligned.timestamps)
  print(aligned.values)
  print(aligned.series_names)

Or use logical series names:

  aligned = tsdf.loc[
      0:1000,
      ["weather.device_a.temperature", "weather.device_a.humidity"],
  ]

Copilot

Pull request overview

This PR introduces a new Python-side “dataframe-like” API for reading TsFile table-model data across multiple TsFile shards, enabling series discovery, per-series access, and time-aligned multi-series queries for model training workflows.

Changes:

Added TsFileSeriesReader to discover series (table + tags + field) and read data via Arrow batch queries.
Added TsFileDataFrame (plus Timeseries / AlignedTimeseries) to unify multiple TsFiles, merge overlapping series, and support .loc aligned queries.
Exported the new classes from python/tsfile/__init__.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
python/tsfile/tsfile_series_reader.py	New Arrow-based series reader with metadata discovery, timestamp caching, and range/time reads.
python/tsfile/tsfile_dataframe.py	New multi-file dataframe-like abstraction with series selection, merged metadata, and `.loc` alignment.
python/tsfile/init.py	Re-export `TsFileDataFrame`, `Timeseries`, and `AlignedTimeseries` from the package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T13:41:52Z

python/tsfile/tsfile_series_reader.py

+                if arrow_table.num_rows > 0:
+                    ts_list.append(arrow_table.column('time').to_numpy())
+                    for fc in field_columns:
+                        field_lists[fc].append(
+                            arrow_table.column(fc).to_numpy().astype(np.float64)
+                        )


In _read_arrow, values are always converted via to_numpy().astype(np.float64). This will fail for FIELD columns that are not numeric (e.g., STRING/TEXT/BLOB/DATE) and can also fail when the Arrow array contains nulls (writing code allows NaN values, which become nulls on read). Consider filtering field_columns up-front to numeric TSDataType only, and/or using Arrow casting + null-to-NaN handling before converting to NumPy (and keeping non-numeric columns out of this reader).

Done. The implementation has since been refactored into python/tsfile/dataset/reader.py. The dataset reader now only includes numeric FIELD columns during discovery, and non-numeric fields are excluded from the dataset path. The current read path also keeps an explicit numeric-compatibility check when converting Arrow arrays.

Copilot · 2026-04-02T13:41:53Z

python/tsfile/tsfile_series_reader.py

+        if series_path in self._series_data_cache:
+            return self._series_data_cache[series_path][start:end].tolist()
+
+        info = self.series_info[series_path]
+        timestamps = self._timestamps_cache[series_path]
+


read_series_range() indexes timestamps[start] and timestamps[end - 1] without validating start/end (including start == end, negative indices, or end > length). This currently raises confusing IndexErrors and makes empty slices impossible. Add explicit range validation and return an empty result for start >= end to match the docstring’s [start, end) semantics.

Suggested change

if series_path in self._series_data_cache:

return self._series_data_cache[series_path][start:end].tolist()

info = self.series_info[series_path]

timestamps = self._timestamps_cache[series_path]

if start < 0 or end < 0:

raise ValueError("start and end indices must be non-negative")

# Handle cached series data first, using [start, end) semantics

if series_path in self._series_data_cache:

data = self._series_data_cache[series_path]

length = len(data)

if start >= length:

return []

if end > length:

end = length

if start >= end:

return []

return data[start:end].tolist()

info = self.series_info[series_path]

timestamps = self._timestamps_cache[series_path]

length = len(timestamps)

if start >= length:

return []

if end > length:

end = length

if start >= end:

return []

Copilot · 2026-04-02T13:41:53Z

python/tsfile/tsfile_series_reader.py

+                    # Multiple tag columns: use structured approach
+                    # Convert to list of tuples for grouping
+                    n = len(all_times)
+                    tag_tuples = [
+                        tuple(all_tags[tc][i] for tc in tag_columns)
+                        for i in range(n)
+                    ]
+                    unique_tuples = list(dict.fromkeys(tag_tuples))
+                    for ut in unique_tuples:
+                        mask = np.array([t == ut for t in tag_tuples], dtype=bool)
+                        self._register_tag_group(
+                            table_name, tag_columns, ut,
+                            field_columns, all_times[mask]


The multi-tag grouping path builds tag_tuples in Python and then, for each unique tuple, constructs a full boolean mask with a Python loop ([t == ut for t in tag_tuples]). This is O(n * unique_tags) and will become a bottleneck for large tables. Consider using a vectorized approach (e.g., structured NumPy array + np.unique(..., return_inverse=True) or pandas/groupby) to compute groups and indices in (near) linear time.

Suggested change

# Multiple tag columns: use structured approach

# Convert to list of tuples for grouping

n = len(all_times)

tag_tuples = [

tuple(all_tags[tc][i] for tc in tag_columns)

for i in range(n)

]

unique_tuples = list(dict.fromkeys(tag_tuples))

for ut in unique_tuples:

mask = np.array([t == ut for t in tag_tuples], dtype=bool)

self._register_tag_group(

table_name, tag_columns, ut,

field_columns, all_times[mask]

# Multiple tag columns: use structured NumPy array for grouping

n = len(all_times)

# Build a structured array with one field per tag column

dtype = [(tc, all_tags[tc].dtype) for tc in tag_columns]

structured_tags = np.empty(n, dtype=dtype)

for tc in tag_columns:

structured_tags[tc] = all_tags[tc]

# Find unique tag combinations and an inverse index for grouping

unique_vals, inverse = np.unique(

structured_tags, return_inverse=True

)

# Group rows by unique tag combination using the inverse index

for group_id, ut in enumerate(unique_vals):

mask = inverse == group_id

tag_values = tuple(ut[tc] for tc in tag_columns)

self._register_tag_group(

table_name,

tag_columns,

tag_values,

field_columns,

all_times[mask],

For tables with multiple tag columns, grouping is done by:

Building a Python list of tuples (line 174-177): O(n)

For each unique tuple, scanning the entire list to build a boolean mask (line 180): O(n * k) where k = number of unique groups

For a table with 1M rows and 1000 tag groups, this becomes ~1 billion comparisons in pure Python.

Done. The implementation has moved to python/tsfile/dataset/reader.py. The pure-Python tuple/mask grouping path was replaced with a vectorized NumPy grouping implementation based on structured arrays and np.unique(..., return_inverse=True).

python/tsfile/tsfile_dataframe.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/tsfile/tsfile_series_reader.py

python/tsfile/tsfile_dataframe.py

jt2594838 · 2026-04-03T01:41:54Z

python/tsfile/tsfile_dataframe.py

+    def show(self, max_rows: Optional[int] = None):
+        """Print formatted table with configurable row limit.
+
+        Args:
+            max_rows: Maximum rows to display. None for all rows.
+        """
+        n_rows, n_cols = self.values.shape
+        if n_rows == 0:
+            print(f"AlignedTimeseries(0 rows, {n_cols} series)")
+            return
+        ts_strs, ts_width, col_widths, val_strs = self._build_display()
+        print(self._format_rows(ts_strs, ts_width, col_widths, val_strs, max_rows))


Pass max_rows to _build_display?

done. The implementation has moved to python/tsfile/dataset/formatting.py and python/tsfile/dataset/timeseries.py. Now all the formatting use general display logic.

jt2594838 · 2026-04-03T01:50:53Z

python/tsfile/tsfile_dataframe.py

+        n_rows = len(ts_strs)
+        n_cols = len(col_widths)
+
+        header_parts = ['timestamp'.rjust(ts_width)]


The column is a datetime string; it is improper to call it timestamp.
May use time or datetime.

Done. The implementation has moved to python/tsfile/dataset/formatting.py. The aligned display header now uses time instead of timestamp.

jt2594838 · 2026-04-03T01:52:02Z

python/tsfile/tsfile_dataframe.py

+        col_widths = []
+        val_strs = []
+        for col_idx in range(n_cols):
+            col_name = self.series_names[col_idx] if col_idx < len(self.series_names) else f'col_{col_idx}'


In what case will col_xxx be used?

The implementation has moved to python/tsfile/dataset/formatting.py. This is now only a defensive fallback for mismatched metadata during formatting. In normal public paths it should not be used, since series_names and value columns are expected to match.

jt2594838 · 2026-04-03T01:58:53Z

python/tsfile/tsfile_dataframe.py

+    def _build_display(self):
+        """Pre-compute string representations for display."""
+        n_rows, n_cols = self.values.shape
+        ts_strs = [_format_timestamp(int(t)) for t in self.timestamps]
+        ts_width = max((len(s) for s in ts_strs), default=0)
+        ts_width = max(ts_width, len('timestamp'))
+
+        col_widths = []
+        val_strs = []
+        for col_idx in range(n_cols):


Is it possible to concat timestamps and values, then use ndarry's print method if any?

The implementation has moved to python/tsfile/dataset/formatting.py. I kept the custom formatter intentionally, because it gives us stable head/tail truncation, NaN rendering, and consistent column formatting, which ndarray printing does not provide well for this API.

jt2594838 · 2026-04-03T02:10:48Z

python/tsfile/tsfile_dataframe.py

+        # Deduplicate by timestamp (keep first occurrence)
+        _, unique_idx = np.unique(merged_ts, return_index=True)
+        return merged_ts[unique_idx], merged_vals[unique_idx]


Why keep the first occurrence?

Done. The implementation has moved to python/tsfile/dataset/merge.py. The current merge policy rejects duplicate timestamps across shards instead of keeping the first occurrence.

jt2594838 · 2026-04-03T05:55:37Z

python/tsfile/tsfile_series_reader.py

+        table_schemas = self._reader.get_all_table_schemas()
+        if not table_schemas:
+            raise ValueError("No tables found in TsFile")
+
+        self.series_paths = []
+        table_names = list(table_schemas.keys())
+
+        # Progress tracking
+        total_rows = 0
+
+        for ti, table_name in enumerate(table_names):
+            table_schema = self._reader.get_table_schema(table_name)


Why not use table_schemas?

jt2594838 · 2026-04-03T05:56:35Z

python/tsfile/tsfile_series_reader.py

+                if col_name.lower() == 'time':
+                    continue


Also check column category?

jt2594838 · 2026-04-03T05:59:31Z

python/tsfile/tsfile_series_reader.py

+            # Query TAG columns + first FIELD column to discover groups and timestamps
+            query_cols = tag_columns + [field_columns[0]]
+
+            time_arrays = []
+            tag_arrays = {tc: [] for tc in tag_columns}
+
+            with self._reader.query_table_batch(
+                table_name, query_cols, batch_size=65536
+            ) as rs:


A TsFile may not actually contain a column's data even if it is in the schema.
Double-check if timestamps can be read in this scenario.

Storing tags repeatedly is a great challenge to memory.
May consider getting all DeviceIds first, and conclude tag values from them.

Done. The implementation has moved to python/tsfile/dataset/metadata.py, python/tsfile/dataset/reader.py, and python/tsfile/dataset/dataframe.py. Shared metadata is now centered on table/device entries, and logical series are represented by lightweight refs instead of repeated per-series tag dictionaries.

jt2594838 · 2026-04-03T06:28:07Z

python/tsfile/tsfile_series_reader.py

+                if arrow_table.num_rows > 0:
+                    ts_list.append(arrow_table.column('time').to_numpy())
+                    for fc in field_columns:
+                        field_lists[fc].append(
+                            arrow_table.column(fc).to_numpy().astype(np.float64)
+                        )


jt2594838 · 2026-04-03T06:30:58Z

python/tsfile/tsfile_series_reader.py

+    def cache_series_data(self, series_path: str):
+        """Pre-load series data into memory cache.
+
+        Args:
+            series_path: Time series path.
+        """
+        if series_path not in self.series_info:
+            raise ValueError(f"Series not found: {series_path}")
+        if series_path not in self._series_data_cache:
+            data = self.read_series(series_path)
+            self._series_data_cache[series_path] = np.array(data, dtype=np.float32)


The method is not used? And there does not seem to be any memory control.

It's an old method. We don't need it now.

JackieTien97 · 2026-04-03T07:13:09Z

python/tsfile/tsfile_series_reader.py

+            raise ValueError(f"Series not found: {series_path}")
+        if series_path not in self._series_data_cache:
+            data = self.read_series(series_path)
+            self._series_data_cache[series_path] = np.array(data, dtype=np.float32)


read_series returns float64 values (via _read_arrow which casts to np.float64), but cache_series_data stores them as float32. When read_series is later called on a cached series, it returns float32 values silently, which can lose precision.

Done. The old cached-series path was removed during the refactor, so this precision inconsistency no longer exists.

JackieTien97 · 2026-04-03T07:15:39Z

python/tsfile/tsfile_series_reader.py

+                    # Multiple tag columns: use structured approach
+                    # Convert to list of tuples for grouping
+                    n = len(all_times)
+                    tag_tuples = [
+                        tuple(all_tags[tc][i] for tc in tag_columns)
+                        for i in range(n)
+                    ]
+                    unique_tuples = list(dict.fromkeys(tag_tuples))
+                    for ut in unique_tuples:
+                        mask = np.array([t == ut for t in tag_tuples], dtype=bool)
+                        self._register_tag_group(
+                            table_name, tag_columns, ut,
+                            field_columns, all_times[mask]


For tables with multiple tag columns, grouping is done by:

Building a Python list of tuples (line 174-177): O(n)

For each unique tuple, scanning the entire list to build a boolean mask (line 180): O(n * k) where k = number of unique groups

For a table with 1M rows and 1000 tag groups, this becomes ~1 billion comparisons in pure Python.

JackieTien97 · 2026-04-03T07:17:04Z

python/tsfile/tsfile_dataframe.py

+                if idx < 0 or idx >= len(self._df._series_list):
+                    raise IndexError(f"Series index {idx} out of range")


Negative indices are not normalized (no idx = length + idx like other __getitem__ methods in this file). A user passing df.loc[:, [-1]] would get an IndexError instead of the last series.

Done. The implementation has moved to python/tsfile/dataset/dataframe.py. Negative indices in .loc series selection are normalized now.

JackieTien97 · 2026-04-03T07:24:29Z

python/tsfile/tsfile_dataframe.py

+            merged = np.unique(np.concatenate(all_ts))
+            merged.sort()


np.unique already returns a sorted array. The .sort() call is redundant.

JackieTien97 · 2026-04-03T07:25:12Z

python/tsfile/tsfile_dataframe.py

+        if max_rows is None or n_rows <= max_rows:
+            show_rows = list(range(n_rows))
+        else:
+            show_rows = list(range(max_rows))


When max_rows is exceeded, only the first max_rows rows are shown. In contrast, TsFileDataFrame._format_table shows head + "..." + tail. This inconsistency may confuse users expecting similar behavior from both display methods.

done. The aligned display now uses the same head + ... + tail truncation style.

python/tsfile/__init__.py

JackieTien97 · 2026-04-03T07:27:30Z

python/tsfile/tsfile_dataframe.py

+    def close(self):
+        """Close all underlying readers.
+
+        No-op for subset views (they don't own readers).
+        """
+        if self._is_view:
+            return
+        for reader in self._readers.values():
+            reader.close()
+        self._readers.clear()


After close(), _readers is cleared but _series_map still holds references to closed readers. Any subsequent data access (e.g., tsdf[0][0]) will attempt to read from a closed reader, producing an unclear error.

Recommendation: Either invalidate _series_map too, or set a _closed flag and check it in data-access paths.

python/tsfile/tsfile_series_reader.py

ycycse force-pushed the tsdf branch from e39d970 to 905630c Compare April 2, 2026 12:31

support tsfileDataFrame

9f5557f

ycycse force-pushed the tsdf branch from 905630c to 9f5557f Compare April 2, 2026 12:33

qiaojialin requested a review from Copilot April 2, 2026 13:37

Copilot started reviewing on behalf of qiaojialin April 2, 2026 13:37 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

JackieTien97 requested a review from Copilot April 2, 2026 13:49

Copilot started reviewing on behalf of JackieTien97 April 2, 2026 13:50 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

python/tsfile/tsfile_series_reader.py Outdated Show resolved Hide resolved

python/tsfile/tsfile_series_reader.py Outdated Show resolved Hide resolved

python/tsfile/tsfile_dataframe.py Outdated Show resolved Hide resolved

python/tsfile/tsfile_dataframe.py Outdated Show resolved Hide resolved

jt2594838 reviewed Apr 3, 2026

View reviewed changes

JackieTien97 requested changes Apr 3, 2026

View reviewed changes

ycycse added 3 commits April 5, 2026 01:45

make tsfileDataFrame a module

32d9332

support metadata management and optmize implementation

88dacfe

support show() method in tsdf

d79bf32

-        if series_path in self._series_data_cache:
-            return self._series_data_cache[series_path][start:end].tolist()
-        info = self.series_info[series_path]
-        timestamps = self._timestamps_cache[series_path]
+        if start < 0 or end < 0:
+            raise ValueError("start and end indices must be non-negative")
+        # Handle cached series data first, using [start, end) semantics
+        if series_path in self._series_data_cache:
+            data = self._series_data_cache[series_path]
+            length = len(data)
+            if start >= length:
+                return []
+            if end > length:
+                end = length
+            if start >= end:
+                return []
+            return data[start:end].tolist()
+        info = self.series_info[series_path]
+        timestamps = self._timestamps_cache[series_path]
+        length = len(timestamps)
+        if start >= length:
+            return []
+        if end > length:
+            end = length
+        if start >= end:
+            return []

		if idx < 0 or idx >= len(self._df._series_list):
		raise IndexError(f"Series index {idx} out of range")

Conversation

ycycse commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Wheel

Validation

Basic Usage

Open one TsFile

Open a shard directory

Filter by metadata

Access one logical series

Build subset views

Time-aligned multi-series query

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycycse commented Apr 2, 2026 •

edited

Loading